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Abstract  —  This  paper  discusses  the  performance 
comparison  of  different  algorithms  for  classification,  es¬ 
timation  and  filtering  problems.  Two  information  the¬ 
oretic  measures,  namely,  the  empirical  mutual  infor¬ 
mation  and  the  asymptotic  information  rate  are  pro¬ 
posed  for  simulation  based  performance  evaluation  and 
algorithm  comparison.  They  can  be  used  as  a  guide¬ 
line  for  designing  a  practical  procedure  to  measure  the 
performance  of  different  algorithms  with  limited  compu¬ 
tational  resources.  Other  useful  performance  measures 
are  reviewed  and  their  relation  to  the  two  new  mea¬ 
sures  discussed.  Several  practical  examples  are  used  to 
provide  some  insights  on  the  inherent  difficulty  of  algo¬ 
rithm  ranking  and  the  advantage  of  using  the  informa¬ 
tion  theoretic  measures  for  algorithm  comparison. 

Keywords:  Performance  evaluation,  information  the¬ 
oretic  measure,  detection,  estimation,  filtering. 

1  Introduction 

Performance  evaluation  aims  to  study  the  behavior 
of  a  system  operated  by  various  algorithms  and  com¬ 
pare  their  pros  and  cons  based  on  a  set  of  measures  or 
metrics  each  of  which  usually  maps  different  algorithms 
into  different  real  values  or  partial  orders  for  ranking. 
In  practical  applications,  as  the  system  being  studied 
becomes  more  and  more  complex  and  complicated,  the 
analytical  results  regarding  the  performance  of  different 
algorithms  with  respect  to  a  particular  measure  usually 
do  not  have  closed  forms  or  they  are  computationally 
intractable.  Thus  simulation  based  performance  eval¬ 
uation  serves  as  an  indispensable  tool  to  measure  the 


performance  of  various  algorithms.  On  the  other  hand, 
there  are  several  distinctive  issues  on  how  to  develop  a 
good  procedure  to  collect  and  disseminate  information 
from  the  system  relevant  to  performance  aspects.  Good 
measures  to  reliably  rank  different  algorithms  as  well 
as  the  assessment  on  the  credibility  of  the  ranking  can 
also  guide  the  algorithm  development  especially  with 
limited  computational  resources.  The  issues  related  to 
performance  evaluation  metrics  and  algorithm  develop¬ 
ment  to  optimize  them  are  often  interrelated  and  de¬ 
mand  an  interdisciplinary  research  on  system  design, 
data  analysis,  simulation  methods,  and  statistical  in¬ 
ference.  In  this  paper,  we  do  not  intend  to  address  the 
design  issues  for  performance  evaluation.  Our  major 
focus  is  on  the  performance  evaluation  of  classification, 
estimation  and  filtering  problems.  The  algorithms  we 
want  to  compare  are  the  so-called  classifiers,  estimators, 
filters  or  a  combination  of  the  above  for  joint  problems. 
A  natural  measure  is  the  classification  error,  estimation 
error  or  filtering  error  which  requires  some  clarification 
for  different  types  of  problems.  With  this  setting,  one 
also  seeks  to  develop  an  algorithm  that  achieves  the 
minimum  error  under  a  given  performance  measure. 
Different  error  measures  will  result  in  different  solu¬ 
tions  each  of  which  corresponds  to  the  minimizer  of  a 
particular  error  measure.  However,  there  is  no  consen¬ 
sus  on  which  measure  is  particularly  good  for  algorithm 
comparison.  One  often  has  to  evaluate  several  measures 
and  display  all  of  the  ranking  results  for  a  practitioner 
to  make  a  decision  based  on  a  certain  weighted  combi¬ 
nation  of  different  measures  in  a  common  metric  space. 

We  want  to  make  a  distinction  between  the  algo- 
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rithm  developer  (AD)  and  the  performance  evaluator 
(PE)  since  they  may  use  different  measures  for  differ¬ 
ent  purposes.  Furthermore,  we  assume  that  the  prob¬ 
lem  being  studied  has  a  probabilistic  structure  and  each 
algorithm  provides  the  statistical  inference  of  this  prob¬ 
abilistic  model  so  that  the  PE  will  gain  certain  amount 
of  information  by  running  an  algorithm.  With  this 
problem  formulation,  we  propose  two  information  the¬ 
oretic  measures  for  ranking  different  algorithms.  One 
measure  is  called  the  empirical  mutual  information  be¬ 
tween  the  PE  and  the  AD,  which  depends  on  the  size 
of  the  test  data.  To  have  an  information  theoretic 
measure  which  is  independent  of  the  data  length,  we 
propose  another  measure  called  asymptotic  information 
rate,  which  characterizes  the  performance  improvement 
of  an  algorithm  with  large  data  size  for  performance 
evaluation.  These  two  measures  are  useful  and  comple¬ 
mentary  to  some  standard  and  existing  error  measures, 
some  of  which  may  not  be  suitable  for  the  performance 
evaluation  of  joint  classification,  estimation,  and  filter¬ 
ing  problems.  We  relate  the  two  information  theoretic 
measures  to  classical  hypothesis  testing,  classification, 
quantization  and  estimation  problems  and  provide  in¬ 
sights  on  their  usage  in  some  nonstandard  settings.  Fi¬ 
nally,  we  give  several  examples  to  show  the  usefulness  of 
the  information  theoretic  measures  for  algorithm  com¬ 
parison. 

2  Performance  Comparison  Us¬ 
ing  Mutual  Information  Based 
Measure 

In  this  section,  we  formulate  performance  evaluation 
as  a  statistical  inference  problem  where  the  PE  holds 
certain  prior  knowledge  of  the  system  parameters  and 
the  algorithm  being  evaluated  provides  additional  infor¬ 
mation  to  reduce  the  uncertainty  of  those  parameters. 
The  mutual  information  quantifies  how  much  the  PE 
can  gain  by  running  a  particular  algorithm. 

The  problems  being  considered  include  hypothesis 
testing,  classification,  parameter  estimation  and  filter¬ 
ing.  We  assume  that  the  PE  can  generate  data  re¬ 
peatedly  based  on  the  same  or  different  probabilistic 
structures  to  test  different  algorithms  with  a  predefined 
measure.  The  underlying  truth  that  governs  the  data 
generation  can  be  (1)  a  statistical  hypothesis  among 
many  predetermined  ones  or  (2)  a  value  in  the  param¬ 
eter  space  or  (3)  a  time  function  from  the  realization 
of  a  random  process.  The  algorithm  is  supposed  to 
choose  a  hypothesis,  produce  an  estimate  of  the  un¬ 
known  parameter  or  a  time  function  of  the  unknown 
process  based  on  the  data  generated  by  the  PE.  The 
purpose  of  the  PE  is  to  rank  different  algorithms  us¬ 
ing  a  set  of  measures.  A  fundamental  question  is  what 
measures  the  PE  tends  to  adopt.  The  measures  have  to 
be  applicable  to  all  the  aforementioned  problems  and 


possibly  joint  ones. 

To  begin  with,  we  consider  a  classification  prob¬ 
lem  where  the  PE  has  K  hypotheses  denoted  as 
{Hi, Hk}-  Each  hypothesis  can  generate  a  data 
sequence  {zi,  Z2,  ■■■}  with  a  certain  mechanism  condi¬ 
tioned  on  that  particular  hypothesis  being  true.  The 
data  sequence  is  used  by  the  algorithm  to  determine 
which  hypothesis  governs  the  underlying  data  genera¬ 
tion  mechanism.  A  standard  measure  is  the  classifica¬ 
tion  error  for  each  hypothesis.  However,  one  can  not 
directly  compare  the  performance  of  two  algorithms  by 
looking  at  two  arrays  of  the  classification  errors  for  all 
hypotheses.  Of  course  one  can  use  the  weighted  clas¬ 
sification  error  as  a  measure  and  those  weights  can  be 
derived  by  minimizing  a  Bayesian  risk  function  if  one 
has  the  prior  probability  of  each  hypothesis  being  used 
to  generate  the  data  [9] .  In  our  formulation,  these  prior 
probabilities  are  determined  by  the  PE  and  they  are 
inherently  subjective.  However,  the  choice  of  probabili¬ 
ties  also  represents  the  PE’s  belief  on  which  hypothesis 
is  true  without  evaluating  any  algorithm. 

Denote  by  X  the  discrete  random  variable  with  prob¬ 
ability  mass  function  (pmf)  P{X)  given  by 

Vr{X  =  Hi)=p„  i=l,...,K  (1) 

For  a  data  sequence  {zi,...,zn}  of  length  N,  denote 
Xjq  the  discrete  random  variable  with  pmf  indicating 
the  probability  that  a  particular  hypothesis  is  chosen  by 
the  classifier  based  on  the  data  sequence  of  length  N . 
The  probability  that  hypothesis  i  is  true  conditioned  on 
hypothesis  j  being  chosen  by  the  classifier  is  denoted  by 

Pr(A  =  H,\Xn  =  Hj)  =  ,  i  =  1, ...,  K-,j  =  1, ...,  K 

(2) 

The  entropy  of  X  is 

K 

H{X)  = -^p,\ogPi  (3) 

i=l 

The  conditional  entropy  of  X  given  Xj^  is 

K  K 

H{X\Xn)  =  -  Qi]  log  Qij  (4) 

i=i  j=i 

The  mutual  information  between  X  and  Xjsf  is  defined 
as 

I(X;Xjv)=H(X)-H(XlXjv)  (5) 

It  quantifies  the  reduction  of  uncertainty  about  which 
hypothesis  is  true  based  on  the  classification  results. 
If  the  classifier  always  chooses  the  correct  hypothe¬ 
sis,  then  the  mutual  information  achieves  its  maximum 
H{X).  Note  that  if  the  classifier  always  chooses  the  in¬ 
correct  hypothesis  when  testing  two  hypotheses,  we  still 
get  the  maximal  mutual  information  and  one  can  easily 
modify  the  decision  of  this  classifier  to  achieve  zero  er¬ 
ror.  Thus  the  mutual  information  is  a  good  indication 
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of  the  classification  performance.  If  one  wants  to  max¬ 
imize  the  mutual  information  over  the  distribution  of 
X,  then  a  uniform  distribution  among  the  K  hypothe¬ 
ses  should  be  used  to  generate  the  data  sequences,  i.e., 
all  classes  have  equal  prior  probabilities  [6]. 

Another  popularly  used  performance  measure  to  rank 
classifier  is  the  classification  error.  However,  this  mea¬ 
sure  can  be  misleading  as  illustrated  in  [14]  (p.  532). 
To  take  the  advantage  of  using  both  mutual  information 
and  classification  accuracy  in  ranking  the  classifiers,  one 
may  consider  a  new  measure,  namely,  the  normalized 
error  to  information  ratio,  given  by 


Bn 

OiN - ^ - 

I{X-,Xn) 


(6) 


where  cat  is  the  classification  error  and  is  a  pre¬ 
specified  parameter  based  on  the  PE’s  preference  be¬ 
tween  having  a  better  classification  accuracy  and  gain¬ 
ing  more  information.  The  best  classifier  should  min¬ 
imize  the  above  measure.  Note  that  when  no  error  is 
made  by  the  classifier,  the  mutual  information  gained 
by  the  performance  evaluator  becomes  irrelevant.  This 
is  reasonable  since  the  PE  will  gain  the  maximum  in¬ 
formation  when  the  classifier  has  zero  error  no  matter 
what  prior  distribution  among  the  K  hypotheses  that 
the  PE  uses. 

Next,  we  consider  a  parameter  estimation  problem 
where  6  is  unknown  and  to  be  estimated  in  a  parame¬ 
ter  space  0.  We  use  a  vector  z at  to  denote  the  observed 
data  sequence  {zi, zn}  of  length  N.  The  estimator 
uses  zat  to  estimate  9  and  provides  the  estimate  9^-  We 
assume  that  the  estimate  is  in  another  space  0.  The 
performance  evaluator  has  prior  uncertainty  about  9 
which  is  characterized  by  the  probability  density  func¬ 
tion  (pdf)  f{9).  The  differential  entropy  of  9  is 


h{9)  =  -  [  f{9)  log  f{9)d9  (7) 

Je 


The  differential  entropy  of  9  given  9n  is 


h{9\9N)  =  -  [  f{9\9N)logf{9\9N)d9  (8) 

Je 

In  practice,  the  PE  can  only  concentrate  on  a  parameter 
space  of  finite  support  especially  when  the  likelihood 
function  A(zAr|0)  does  not  have  a  parametric  form.  For 
convenience,  we  assume  that  the  prior  pdf  is  proper 
and  the  above  differential  entropies  always  exist.  In 
this  case,  the  mutual  information  between  9  and  9]\[  is 
defined  as 


I{9;9N)=h{9)-h{9\9N)  (9) 


Similarly,  for  a  random  process  9(t),  an  algorithm 
should  provide  the  estimate  9]s[{t)  and  we  can  define 
the  average  entropy  of  9{t)  as 


h*{9{t))  =  lim 

t— ^OO 


/-oo  h{9{u))du 
t 


(10) 


The  average  mutual  information  between  9{t)  and  9]\[(t) 
is  defined  as 

I{9{t);  9N{t))  =  h*{9{t))  -  h*{9{t)\9N{t))  (11) 

The  major  issue  of  the  above  generalization  of  the 
mutual  information  measure  from  hypothesis  testing 
and  classification  problems  to  estimation  and  filtering 
problems  is  that  the  PE  needs  to  evaluate  the  integral 
accurately  when  applying  the  mutual  information  mea¬ 
sure  which  requires  the  knowledge  of  the  continuous  pdf 
f{9\9N)  or  f{9{t)\9N{t)).  In  practice,  estimating  a  con¬ 
tinuous  pdf  using  a  finite  number  of  realizations  is  an 
ill-posed  problem  and  one  has  to  assume  certain  prop¬ 
erties  (e.g.,  smoothness,  finite  dimensional  parametric 
family)  of  the  pdf  in  order  to  obtain  a  unique  solu¬ 
tion.  Alternatively,  if  we  modify  the  estimation  and 
filtering  problems  so  that  the  original  probability  mea¬ 
sure  is  approximated  by  another  measure  defined  in  a 
finite  partition  of  the  parameter  space,  then  the  PE 
only  needs  to  evaluate  the  pmf  of  9  or  9{t)  in  a  finite 
discrete  space,  which  makes  the  algorithm  comparison 
feasible  via  computer  simulations.  We  will  explain  this 
idea  next. 

3  Empirical  Mutual  Information 
and  Asymptotic  Information 
Rate 

In  this  section,  we  convert  the  performance  evalua¬ 
tion  of  a  class  of  estimation  and  filtering  problems  into 
a  classification  problem  with  tolerable  distortion  of  the 
mutual  information  based  on  a  properly  chosen  distor¬ 
tion  metric.  We  call  the  resulting  measure  empirical 
mutual  information  which  can  be  applied  to  algorithm 
comparison  without  the  ill-posed  issue  in  density  esti¬ 
mation. 

A  distortion  measure  is  a  mapping  d  :  0  x  0  ^ 
from  the  set  of  parameter-estimate  pairs  into  the  set  of 
nonnegative  real  numbers.  A  commonly  used  distortion 
measure  is  the  squared  error  given  by 

d{9,9)  =  \\9-9\\^  (12) 

Note  that  the  Euclidian  distance  between  the  parame¬ 
ter  and  its  estimate  is  also  a  metric  since  the  parameter 
space  is  usually  a  metric  space.  A  distortion  measure 
is  said  to  be  bounded  if 

max  d{9,9)  <  oo  (13) 

See. flee 

In  most  cases,  the  parameter  space  and  the  estimate 
space  are  the  same.  We  are  interested  in  the  partition 
of  the  parameter  space  with  a  desired  distortion  bound. 
An  M  partition  of  0,  i.e.,  0  =  0i  U  •  •  •  U  0m  and 
0i  n  Qj  =  (/),  Vi  yf  j,  is  said  to  be  Jm  bounded  if 

max  d{9o,9i)<dM,  i=l,...,M  (14) 
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For  a  bounded  parameter  space  and  a  bounded  dis¬ 
tortion  measure,  one  can  choose  an  appropriate  value 
(Im  to  have  a  finite  partition  of  the  parameter  space 
being  cLm  bounded.  A  particular  subset  0i  of  the  pa¬ 
rameter  space  represents  a  hypothesis  i  while  9  G  Qi 
represents  the  estimator  chooses  the  correct  hypothe¬ 
sis  when  9  G  <3 i-  A  good  distortion  measure  often  has 
the  property  that  (Im  can  be  made  arbitrarily  small 
even  when  M  is  finite.  The  partition  of  the  parameter 
space  and  the  associated  distortion  measure  are  com¬ 
monly  treated  in  quantization  theory  for  designing  the 
best  partition  of  the  parameter  space  and  the  presen- 
tative  value  of  each  region  to  minimize  the  expected 
conditional  distortion  measure  [7,  6].  Here  our  focus  is 
to  convert  the  estimation  and  filtering  problem  into  a 
classification  problem  so  that  the  mutual  information 
measure  can  be  evaluated  to  compare  the  performance 
among  different  estimators.  One  may  argue  that  there 
exist  several  good  performance  measures  for  the  estima¬ 
tion  problem  such  as  the  mean  square  error,  efficiency, 
consistency  and  unbiasedness  [15,  1].  However,  these 
measures  require  that  the  PE  has  the  knowledge  of  the 
likelihood  function  of  the  data  sequence  being  gener¬ 
ated  or  the  prior  distribution  of  the  parameter  which 
may  not  be  available  for  some  practical  problems. 

We  assume  that,  for  any  given  value  9  G  Q,  the  PE 
can  generate  a  data  sequence  of  length  N  and  evaluate 
the  estimator  using  9^.  For  any  subset  Qi  of  the  M 
partition  of  0,  the  distortion  is  (Im  bounded.  The  prior 
distribution  of  9  being  in  one  of  the  M  partitions  is 
given  by 

Pr(0e0.)=P.,  i  =  (15) 

The  conditional  probability  Pr(0  G  0i|6*Ar  G  Oj)  is 
given  by 

Pr(6*  G  0i|0Ar  G  Oj)  =  q^j,  i  =  1,...,M;  j  =  1,...,M 

(16) 

The  empirical  entropy  H{9)  is  given  by 

M 

H{0)  = -'^Pt^ogpi  (17) 

i=l 

The  empirical  conditional  entropy  of  9  given  6^  is 

M  M 

H{9\9n)  =  -  qij^ogqij  (18) 

i=l  3  =  1 

The  empirical  mutual  information  between  9  and  is 

I{9-9n)=H{9)-H{9\9n)  (19) 

This  empirical  mutual  information  depends  on  the  data 
length  N .  The  approximation  accuracy  to  the  true 
mutual  information  depends  on  the  partition  and  the 
proper  choice  of  cIm-  The  maximum  empirical  mutual 


information  can  be  achieved  when  the  estimator  always 
chooses  the  correct  region  among  the  M  partitions  if  the 
true  value  of  9  is  in  that  region.  If  the  performance  eval¬ 
uator  has  additional  knowledge  about  the  construction 
of  the  estimator  in  an  analytic  form,  then  the  asymp¬ 
totic  performance  using  uniform  quantization  can  also 
be  evaluated  [6]. 

For  the  estimation  of  a  random  process,  i.e.,  a  filter¬ 
ing  problem,  the  PE  can  choose  a  sequence  of  unknown 
parameters  9^  =  {9i,...,9s}  at  as  the  repre¬ 

sentative  points  (samples)  of  the  process  and  evaluate 
the  distortions  at  those  times  based  on  the  M  parti¬ 
tions  of  9i  {i  =  1,...,S').  Assuming  that  at  any  time 
ti  the  M  partition  of  9i  is  cIm  bounded,  the  empirical 
average  mutual  information  is  approximated  by 

/(0(t),0^(t))«|/(0^;0l)  (20) 

In  the  above  definition,  we  also  assume  that  a  data 
sequence  of  length  N  is  generated  at  each  time  ti 
{i  =  Clearly,  the  empirical  mutual  informa¬ 

tion  between  9i  and  9i  depends  on  N.  If  {9i, ...,  9s}  are 
independent,  then  we  have 

1  ^ 

/(0(t),0w(t))«-V/(0G0^Ar)  (21) 

*  1=1 

To  have  a  performance  measure  independent  of  N, 
we  will  focus  on  the  asymptotic  information  gain  as  N 
increases.  In  a  classification  problem,  the  information 
gain  from  one  additional  observation  is  /(A;  A^r+i)  — 
I{X]Xm)-  However,  as  N  goes  to  infinity,  the  infor¬ 
mation  gain  can  approach  to  zero.  Denote  by  A/jv  the 
information  gain  by  using  N  +  1  observations  instead 
of  N  observations.  For  an  estimation  problem,  we  have 

AlN  =  I{9;9N+i)-m9N)  (22) 

For  a  filtering  problem,  we  have 

AIn  =  I{9{t);9N+i{t))  -  I{9{t);9N{t))  (23) 

The  mutual  information  can  be  computed  using  the  em¬ 
pirical  mutual  information  with  (Im  bounded  partition. 
If  there  exists  a  value  (3  such  that 

0  <  lim  A-^A/at  =  C{I3)  <  +oo  (24) 

N—^OO 

then  we  define  the  asymptotic  information  rate  as  (3  and 
the  gain  as  C{(3). 

For  an  unknown  parameter  9  with  Gaussian  prior 
distribution  being  observed  under  additive  white  Gaus¬ 
sian  noise,  we  have  (3=1  and  and  C{(3)  =  0.5  [8]. 
For  a  target  with  white  noise  acceleration  motion  being 
observed  by  N  sensors  with  position  measurements  un¬ 
der  additive  white  Gaussian  noises  independent  across 
sensors,  in  the  steady  state,  the  centralized  estimator 
yields  (3  =  0.75  while  the  distributed  estimator  using 


877 


track-to-track  fusion  without  any  feedback  has  /3  =  0 
[5].  Clearly,  a  larger  value  of  /3  indicates  a  better  rate 
of  information  gain  in  the  asymptotic  regime.  If  two 
algorithms  have  the  same  rate  /?,  the  one  with  a  larger 
C(/3)  is  expected  to  have  a  better  performance  for  large 
N.  To  estimate  the  asymptotic  information  rate  em¬ 
pirically,  the  PE  needs  to  compute  the  increase  of  the 
mutual  information  due  to  an  additional  observation  as 
a  function  of  N  and  uses  the  log  plot  to  find  the  slope 
within  a  certain  range  for  large  N.  We  will  elaborate 
this  with  additional  illustrative  examples  in  the  next 
section. 


4  Illustrative  Examples 

Example  1  (classification)  [14]:  Consider  a  classi¬ 
fication  problem  where  the  PE  provides  input  x  to  the 
AD  and  the  AD  outputs  y(x)  so  that  the  PE  can  com¬ 
pare  y  with  the  true  class  value  t.  The  PE  used  100 
testing  samples  to  evaluate  three  classifiers  A-C  with 
the  confusion  matrix  as  below.  We  can  see  that  both 
classifiers  A  and  B  have  the  same  error  rate  of  10%  and 
classifier  C  has  a  larger  error  rate  of  12%.  However, 
classifier  A  simply  guesses  that  the  outcome  is  0  for 
all  cases  while  classifier  B  makes  no  error  when  declar¬ 
ing  y  =  0  and  has  a  50%  chance  being  correct  when 
declaring  y  =  1  as  opposed  to  the  prior  probability 
P(t  =  1)  =  0.1.  Clearly,  the  PE  knows  that  the  mutual 
information  from  classifier  B  is  larger  than  that  from 
classifier  A.  Using  normalized  error  to  information  ratio 
measure,  no  matter  what  the  PE  chooses,  classifier 
A  will  always  be  the  worst  while  classifier  B  is  better 
than  classifier  C. 
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Note  that  the  most  informative  classifier  to  the  PE 
does  not  necessarily  provide  the  best  classification  ac¬ 
curacy.  However,  the  information  theoretic  measure  is 
meaningful  especially  when  the  AD  does  not  provide 
the  PE  the  statistical  model  that  the  classification  al¬ 
gorithm  is  built  upon  but  the  classifier  itself. 

Example  2  (hypothesis  testing):  Consider  a  binary 
detection  problem  with  the  observation  sequence  given 
by  Zi  =  9  +  Wi,  i  =  The  signal  6  is  modeled  by 

the  following  two  simple  hypotheses  Hq:  9  =  0  vs.  Hi: 
9=1.  The  noise  sequence  is  white  and  Wi  follows  the 
double  exponential  distribution  with  the  pdf 

p{w)  =  (25) 

One  detector  uses  the  likelihood  ratio  test  that  com¬ 
putes  the  test  statistic 


N 

Tin  =  ^  Ui 

i=l 


where 


Ui  = 


1  Zi>  I 

2zi-l  0<  Zi<l 

—  1  Zi  <0 


Another  detector  uses  the  sign  test 


N 

T2n  =  ^sign(zj  -  0.5) 
2=1 


(26) 


(27) 


(28) 


Classifier  A  Classifier  B  Classifier  C 


y 

0 

1 

y 

0 

1 

y 

0 

1 

t=o 

90 

IT 

t=0 

80 

10 

t=o 

78 

12 

t=i 

10 

0 

t=l 

0 

10 

t=i 

0 

10 

Next,  we  assume  that  the  classifier  can  output  a  ‘?’ 
indicating  that  it  is  not  sure  whether  the  input  should 
belong  to  any  of  the  two  classes.  The  PE  wants  to 
compare  classifiers  D  and  E  with  100  testing  samples 
and  the  confusion  matrix  is  shown  as  below.  Both  clas¬ 
sifiers  D  and  E  have  6%  error  rate  and  11%  rejection 
rate  (‘?’))  however,  one  should  not  conclude  that  they 
have  the  same  classification  performance.  Classifier  E 
is  just  the  classifier  C  in  disguise:  When  C  declares 
y  =  1,  E  will  toss  a  coin  with  equal  chance  declaring 
y  =  1  and  y  =?.  Classifier  D  is  more  informative  than 
classifier  B:  it  makes  no  error  when  declaring  y  =  0  and 
has  a  60%  chance  being  correct  when  declaring  y  =  1. 
Again,  using  normalized  error  to  information  ratio,  no 
matter  what  un  the  PE  chooses,  classifier  D  is  always 
better  than  classifier  E.  It  is  in  line  with  our  intuition 
that  classifier  B  performs  better  than  classifier  C. 


and  compare  with  the  threshold  0.  The  sign  test  is 
not  optimal  but  it  only  assumes  that  the  noise  pdf  is 
symmetrical  around  zero.  The  performance  evaluator 
has  equal  prior  probability  on  the  two  hypotheses  and 
wants  to  compare  the  two  detectors  using  the  informa¬ 
tion  theoretic  measure.  Figure  1  shows  the  empirical 
mutual  information  as  a  function  of  the  total  observa¬ 
tions  N.  We  can  see  that  the  likelihood  ratio  detector 
has  a  slightly  better  performance  than  the  sign  detec¬ 
tor  and  the  performance  gap  decreases  as  N  increases. 
When  N  approaches  infinity,  both  detectors  achieve  the 
maximum  information  of  1  which  implies  no  detection 
error.  Note  that,  under  small  N  (low  SNR  regime),  the 
sign  detector  has  the  mutual  information  close  to  that 
of  the  likelihood  ratio  detector.  This  is  consistent  with 
the  standard  analysis  that  the  sign  detector  is  locally 
optimal  [9].  In  this  example,  we  can  see  that  the  sign 
detector  does  not  lose  performance  by  much  compared 
with  the  optimal  likelihood  ratio  detector. 

Example  3  (parameter  estimation):  A  coin  has 
a  probability  p  of  coming  up  heads  which  is  unknown. 
The  performance  evaluator  tosses  the  coin  N  times  and 
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Figure  1:  Comparison  of  the  empirical  mutual  informa¬ 
tion  between  the  likelihood  ratio  detector  and  the  sign 
detector. 

M  heads  have  occurred.  If  the  performance  evaluator 
has  a  uniform  prior  (corresponding  to  a  Beta  distribu¬ 
tion  with  parameters  5(1, 1))  on  p,  then  the  posterior 
of  p  is  B{M  +  1,  N  —  M  +  1)  where  B{m,n)  denotes 
Beta  distribution  with  pdf 

1  (mji  (nj 


Pi  than  from  estimator  p2-”  If  the  performance  evalu¬ 
ator  chooses  a]\[=l,  then  the  normalized  error  to  infor¬ 
mation  ratio  also  reveals  that  pi  performs  better  than 
P2.  In  fact,  the  assumptions  made  in  pi  seems  closer  to 
the  performance  evaluation  procedure  and  the  results 
are  as  expected. 


N 


Figure  2:  Comparison  of  the  empirical  mutual  informa¬ 
tion  between  the  two  estimators. 

Example  4  (joint  classification  and  estimation): 

Consider  a  communication  problem  with  the  following 
observation  equation 


The  mutual  information  is  the  difference  of  the  differ¬ 
ential  entropy  between  the  prior  and  the  posterior  on  p. 
This  Bayesian  procedure  requires  that  the  performance 
evaluator  has  the  knowledge  of  the  likelihood  function 
and  the  inference  on  p  is  summarized  by  the  whole  pos¬ 
terior  density. 

If  the  performance  evaluator  does  not  have  the  com¬ 
plete  knowledge  of  the  likelihood  function  and  the  al¬ 
gorithm  developer  only  provides  a  point  estimate  on  p, 
then  the  posterior  density  can  not  be  fully  specified.  In 
this  case,  the  empirical  mutual  information  is  helpful 
for  performance  comparison.  Let  us  assume  that  one 
estimator  gives  Pi  =  ^  and  another  estimator  gives 
P‘2  =  The  performance  evaluator  desires  to  have 

\p  —  p\  <  0.1  as  the  distortion  bound.  Five  possible  val¬ 
ues  10.1,0.3,0.5,0.7,0.9}  of  p  are  used  to  generate  N 
tosses  to  evaluate  the  performance  of  the  two  estima¬ 
tors.  Figure  2  shows  the  empirical  mutual  information 
as  a  function  of  N  for  the  two  estimators.  We  can  see 
that  for  small  N ,  the  maximum  likelihood  estimator 
Pi  (Estimator  1)  has  slightly  larger  mutual  information 
than  the  other  estimator  p2  (Estimator  2),  which  is  the 
Bayesian  predictive  probability  that  the  next  toss  will 
be  a  head  after  seeing  M  heads  in  N  tosses.  Note  that 
the  result  does  not  say  that  pi  yields  better  estimation 
accuracy  than  p2,  however,  the  performance  evaluator 
can  interpret  it  this  way:  “If  p  can  only  take  five  possi¬ 
ble  values,  I  will  gain  more  information  from  estimator 


Zi  = 'qxi  + ni,i  =  (30) 

where  Xi  G  {—1,1}  is  the  message  to  be  transmitted; 
T]  G  (0,-|-oo)  is  an  unknown  fading  coefficient;  and 
m  ~  A/’(0,  cr^)  is  the  additive  white  Gaussian  noise. 
Given  the  observation  sequence,  one  has  to  decode 
the  message  {xi}  and  estimate  the  fading  parameter 
t]  jointly.  In  communication,  one  often  cares  only  the 
decoding  performance,  however,  some  joint  classifica¬ 
tion  and  estimation  problems  with  similar  setup  may 
require  to  evaluate  the  decision  and  estimation  errors 
simultaneously.  Assume  that  the  message  sequence  is  a 
Markov  chain  with  known  transition  probability,  then 
the  maximum  likelihood  estimate  of  both  {xi\  and  p  is 

N 

{x^},rJ  =  a,Tg  min  -logP{xi\xi-i)  (31) 

{Xi},r]  “ 

1—1 

which  is  computationally  prohibitive  if  one  has  to  exam¬ 
ine  all  possible  message  sequences.  An  efficient  method, 
denoted  by  Algorithm  A,  makes  the  decision  first  using 

Xi  =  sign(zj)  (32) 

and  then  estimate  p  using  least  squares  assuming  that 
each  decoded  message  is  correct.  Alternatively,  Algo¬ 
rithm  B  performs  the  estimation  directly  with 


p  = 


\ 


N 


Viv- 


(33) 
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and  the  classification  of  the  message  is  made  by 

N 

{Xi}  =  argminy'(zi  -  f]Xi)'^  -  \ogP{xi\xi-i)  (34) 

{Xi}  “ 


using  Viterbi  algorithm.  By  applying  empirical  mu¬ 
tual  information  measure,  we  found  that  Algorithm  B 


is  better  than  A  when  P{xi\xi-i) 


0.9  0.1 
0.1  0.9 


N 


and  are  large  enough.  This  confirms  with  our  in¬ 
tuition  that  incorrect  decision  using  the  sign  detector 
also  leads  to  poor  estimation  accuracy  of  the  fading  pa¬ 
rameter.  Thus  the  proposed  measure  can  also  be  mean¬ 
ingful  to  evaluate  algorithms  developed  for  other  joint 
classification  and  estimation  problems  with  more  com¬ 
plex  structure  where  classification  or  estimation  perfor¬ 
mance  alone  may  not  be  the  only  focus  by  the  PE. 
Example  5  (joint  classification  and  filtering): 
Consider  a  dynamic  system  with  state  equation 


Xk  =  FjXk-i  +  Vk-i  (35) 


where  Fi  = 


1  0 
0  1 


and  F2  = 


The  noise 


Vk  ~  A/’(0,  Q)  is  white  Gaussian  sequence  with  Q  = 

1  1 

f  2  I  cr,^.  The  observation  model  is 

2  ^ 


Zk  =  FIxk  +  Wk 


(36) 


where  F[  =  [  1  0  ]  and  Wk  ~  A/’(0,  ct^)  is  white 
Gaussian  sequence  independent  of  Vk-  We  are  inter¬ 
ested  in  sequentially  estimate  the  state  Xk  and  clas¬ 
sify  the  dynamic  model  Fj.  Denote  by  the  model 
at  time  k  and  assume  that  the  model  sequence  Mk 
is  a  Markov  chain  with  transition  probability  matrix 


P{Mk\Mk-i) 


0.9  0.1 
0.1  0.9 


The  system  is  of  linear 


Markov  jump  type  which  can  be  extended  to  handle 
maneuvering  target  tracking  [2]  and  joint  target  classi¬ 
fication  and  tracking  [4]  problems. 

Clearly,  the  conditional  density  of  the  state  Xk  is  a 
Gaussian  mixture 


2’’ 

p{xk\Z^)  =  j]p(crfc|M'=’',Z'=)F(M'=’'|Z'=)  (37) 

/=! 


with  exponentially  increasing  components.  The  proba¬ 
bility  P{M^’^\Z^)  of  a  model  history  can  be  obtained 
recursively  using  Bayes’  formula  [1].  Consider  Algo¬ 
rithm  A  that  always  keeps  16  model  history  sequences 
with  the  largest  probabilities,  discards  the  rest  of  the 
sequences,  and  renormalizes  the  probabilities.  It  can  be 
interpreted  as  multiple  hypothesis  tracking  (MHT)  [2] 
with  hypothesis  pruning.  Another  Algorithm  B  com¬ 
bines  the  histories  of  models  and  keeps  only  the  possi¬ 
ble  models  in  the  last  two  sampling  periods.  Algorithm 
B  requires  4  filters  to  operate  in  parallel  and  is  called 


the  generalized  pseudo-Bayesian  of  order  2  (GPB2)  [1]. 
Algorithm  C  uses  interacting  multiple  model  (IMM)  [1] 
and  makes  decision  sequentially  according  to  the  ap¬ 
proximate  model  probability  P{Mk\Z’^).  We  assume 
that  crj,  =  =  1  and  compute  the  asymptotic  in¬ 

formation  rate  for  Algorithms  A-C.  Based  on  the  ap¬ 
proximate  slope  from  a  state  and  model  sequence  of 
length  200,  we  obtained  Pa  =  0.112,  Pb  =  0.088  and 
Pc  =  0.084.  This  is  in  line  with  the  existing  results  re¬ 
ported  on  the  state  estimation  accuracy  [1]:  IMM  has 
comparable  tracking  error  to  GPBl  (but  with  less  com¬ 
putational  requirement)  and  MHT  with  hypotheses  of 
longer  time  history  can  further  improve  the  tracking  ac¬ 
curacy.  Interestingly,  Algorithm  B  has  slightly  smaller 
average  classification  error  on  the  choice  of  the  model 
at  each  time  than  Algorithm  G.  This  is  also  reflected 
from  a  small  difference  in  their  asymptotic  information 
rates. 


5  Relation  to  Other  Perfor¬ 
mance  Measures 

Mutual  information  was  originally  proposed  to  char¬ 
acterize  the  capacity  of  a  communication  channel  [6]. 
Its  extension  to  evaluate  an  estimator  is  usually  based 
on  the  Fisher  information,  which  is  related  to  the 
Gramer-Rao  lower  bound  of  the  mean  square  estimation 
error  [1].  Another  connection  between  the  mutual  infor¬ 
mation  and  mean  square  estimation  error  in  Gaussian 
channel  was  discussed  in  [8].  However,  these  informa¬ 
tion  theoretic  measures  are  algorithm  independent.  For 
evaluating  the  quality  of  a  classifier,  a  complete  error 
rate  vs.  rejection  rate  curve  is  usually  generated  for  each 
classifier.  Some  people  use  the  area  under  this  curve 
to  rank  different  classifiers  [14].  An  alternative  metric 
to  classifier  ranking  in  the  Neyaman-Pearson  paradigm 
was  proposed  in  [16].  There  exists  abundant  perfor¬ 
mance  measures  for  the  evaluation  of  estimation  algo¬ 
rithms  [17,  3,  13].  However,  practitioners  often  use  the 
mean  square  estimation  error  to  rank  estimation  and 
filtering  algorithms.  In  addition,  if  an  algorithm  also 
provides  its  self  assessment  on  the  mean  square  estima¬ 
tion  error,  the  evaluation  of  this  additional  information 
often  requires  the  credibility  test  [11,  12].  There  is  no 
comprehensive  measure  of  the  credibility  of  an  estima¬ 
tor  except  the  noncredibility  index  (NGI)  proposed  in 
[10].  However,  NGI  can  not  be  easily  extended  to  eval¬ 
uate  the  credibility  of  a  classifier  even  when  the  classi¬ 
fier  can  provide  its  self-assessment  of  the  classification 
accuracy  in  terms  of  the  confusion  matrix.  There  are 
inherent  difficulties  in  evaluating  the  performance  of 
joint  classification  and  estimation  algorithms.  Estima¬ 
tion  error  and  classification  error  are  often  incompatible 
and  the  performance  evaluator  may  not  have  access  to 
the  precise  description  of  the  algorithm  but  its  behav¬ 
ior  via  testing  examples.  Thus  empirical  mutual  infor¬ 
mation  and  asymptotic  information  rate  are  important 
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indicators  for  the  performance  evaluator  to  meaning¬ 
fully  judge  an  algorithm’s  quality  or  compare  the  per¬ 
formance  among  different  algorithms  through  carefully 
controlled  scenarios.  There  is  no  need  to  treat  the  deci¬ 
sion  and  estimation  problems  separately.  On  the  other 
hand,  we  do  not  intend  to  replace  the  existing  perfor¬ 
mance  measures  or  metrics  with  information  theoretic 
measures,  but  to  compliment  them  in  the  algorithm 
evaluation  and  comparison  problems  of  practical  inter¬ 
est. 

6  Discussions  and  Conclusions 

In  this  paper,  we  studied  the  performance  evalua¬ 
tion  using  several  newly  derived  information  theoretic 
measures,  namely,  the  empirical  mutual  information, 
normalized  error  to  information  ratio,  and  the  asymp¬ 
totic  information  rate  for  classification,  estimation  and 
filtering  problems.  They  serve  as  a  guideline  for  design¬ 
ing  a  practical  procedure  to  measure  the  performance 
of  different  algorithms  with  limited  knowledge  of  the 
parametric  model  that  an  algorithm  developer  is  based 
upon.  Several  practical  examples  including  joint  deci¬ 
sion  and  estimation  are  used  for  algorithm  comparison 
and  for  gaining  the  insight  on  some  inherent  difficul¬ 
ties  of  algorithm  ranking.  In  most  cases,  information 
theoretic  measures  do  rank  the  performance  of  the  can¬ 
didate  algorithms  properly  even  in  the  joint  classifica¬ 
tion  and  estimation  problem  where  classification  or  es¬ 
timation  accuracy  alone  does  not  provide  the  complete 
picture  of  algorithm  performance. 
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